1 Where Do People Drink The Most Beer, Wine And Spirits?

Back in 2014, fivethiryeight.com published an article on alchohol consumption in different countries. The data drinks is available as part of the fivethirtyeight package. Make sure you have installed the fivethirtyeight package before proceeding.

library(fivethirtyeight)
data(drinks)

After skimming the data, wer can see that there are no missing values in the dataset. The variable types are:

Column name country beer_servings spirit_servings wine_servings total_litres_of_pure_alcohol
Variable type character integer integer integer double
glimpse(drinks)
## Rows: 193
## Columns: 5
## $ country                      <chr> "Afghanistan", "Albania", "Algeria", "And…
## $ beer_servings                <int> 0, 89, 25, 245, 217, 102, 193, 21, 261, 2…
## $ spirit_servings              <int> 0, 132, 0, 138, 57, 128, 25, 179, 72, 75,…
## $ wine_servings                <int> 0, 54, 14, 312, 45, 45, 221, 11, 212, 191…
## $ total_litres_of_pure_alcohol <dbl> 0.0, 4.9, 0.7, 12.4, 5.9, 4.9, 8.3, 3.8, …
skim(drinks)
Data summary
Name drinks
Number of rows 193
Number of columns 5
_______________________
Column type frequency:
character 1
numeric 4
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
country 0 1 3 28 0 193 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
beer_servings 0 1 106.16 101.14 0 20.0 76.0 188.0 376.0 ▇▃▂▂▁
spirit_servings 0 1 80.99 88.28 0 4.0 56.0 128.0 438.0 ▇▃▂▁▁
wine_servings 0 1 49.45 79.70 0 1.0 8.0 59.0 370.0 ▇▁▁▁▁
total_litres_of_pure_alcohol 0 1 4.72 3.77 0 1.3 4.2 7.2 14.4 ▇▃▅▃▁

We will now make a plot that shows the top 25 beer consuming countries-

drinks %>% 
  slice_max ( order_by = beer_servings, n=25 ) %>% # taking top 25 countries by servings
  #plotting a graph 
  ggplot(aes(x = beer_servings, y = fct_reorder(country, beer_servings))) +
  geom_col(fill="orange") +
  #labelling the graph and the axes
  labs(
    title = "Top 25 Beer Consuming Countries in 2010",
    subtitle = "Standard Servings Per Person",
    x = "Beer Servings (in cans)",
    y = "Country"
  )

We will now make a plot that shows the top 25 wine consuming countries-

drinks %>% 
  slice_max ( order_by = wine_servings, n=25 ) %>% # taking top 25 countries by servings
  ggplot(aes(x = wine_servings, y = fct_reorder(country, wine_servings))) +
  geom_col(fill="dark red") +
  labs(
    title = "Top 25 Wine Consuming Countries in 2010",
    subtitle = "Standard Servings Per Person",
    x = "Wine Servings (in glasses)",
    y = "Country"
  )

Finally, make a plot that shows the top 25 spirit consuming countries

drinks %>% 
  slice_max ( order_by = spirit_servings, n=25 ) %>% # taking top 25 countries by servings
  ggplot(aes(x = spirit_servings, y = fct_reorder(country, spirit_servings))) +
  geom_col(fill="grey") +
  labs(
    title = "Top 25 Spirit Consuming Countries in 2010",
    subtitle = "Servings (in shots) Per Person",
    x = "Spirit Servings",
    y = "Country"
  )

> Inferences & Key takeaways-

  1. European countries are high consumers of wine.

  2. Beer is more evenly distributed around the world in the top 25 countries, as compared to wine and spirit.

  3. European countries are higher ranked for overall consumption of beer, wine and spirit.

  4. We see the countries with higher population are lower on the graphs, since the data is plotted per population. Hence the distribution of ages in the population within each country will also affect the alcohol, wine and spirits consumption.

2 Analysis of movies- IMDB dataset

We will look at a subset sample of movies, taken from the Kaggle IMDB 5000 movie dataset

Besides the obvious variables of title, genre, director, year, and duration, the rest of the variables are as follows:

  • gross : The gross earnings in the US box office, not adjusted for inflation
  • budget: The movie’s budget
  • cast_facebook_likes: the number of facebook likes cast members received
  • votes: the number of people who voted for (or rated) the movie in IMDB
  • reviews: the number of reviews for that movie
  • rating: IMDB average rating

2.1 Import, inspection, and cleaning of dataset-

From the dataset, we can see that: 1. There are no missing values in the dataset. 2. The movies with duplicated entries. There are Duplicates (2907 distinct titles in 2961 rows) The following piece of code shows how we have cleaned it.

movies <- read_csv(here::here("data", "movies.csv"))
glimpse(movies)
## Rows: 2,961
## Columns: 11
## $ title               <chr> "Avatar", "Titanic", "Jurassic World", "The Avenge…
## $ genre               <chr> "Action", "Drama", "Action", "Action", "Action", "…
## $ director            <chr> "James Cameron", "James Cameron", "Colin Trevorrow…
## $ year                <dbl> 2009, 1997, 2015, 2012, 2008, 1999, 1977, 2015, 20…
## $ duration            <dbl> 178, 194, 124, 173, 152, 136, 125, 141, 164, 93, 1…
## $ gross               <dbl> 7.61e+08, 6.59e+08, 6.52e+08, 6.23e+08, 5.33e+08, …
## $ budget              <dbl> 2.37e+08, 2.00e+08, 1.50e+08, 2.20e+08, 1.85e+08, …
## $ cast_facebook_likes <dbl> 4834, 45223, 8458, 87697, 57802, 37723, 13485, 920…
## $ votes               <dbl> 886204, 793059, 418214, 995415, 1676169, 534658, 9…
## $ reviews             <dbl> 3777, 2843, 1934, 2425, 5312, 3917, 1752, 1752, 35…
## $ rating              <dbl> 7.9, 7.7, 7.0, 8.1, 9.0, 6.5, 8.7, 7.5, 8.5, 7.2, …
# no missing values. There are Duplicates (2907 distinct titles in 2961 rows). 
skim(movies)
Data summary
Name movies
Number of rows 2961
Number of columns 11
_______________________
Column type frequency:
character 3
numeric 8
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
title 0 1 1 83 0 2907 0
genre 0 1 5 11 0 17 0
director 0 1 3 32 0 1366 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
year 0 1 2.00e+03 9.95e+00 1920.0 2.00e+03 2.00e+03 2.01e+03 2.02e+03 ▁▁▁▂▇
duration 0 1 1.10e+02 2.22e+01 37.0 9.50e+01 1.06e+02 1.19e+02 3.30e+02 ▃▇▁▁▁
gross 0 1 5.81e+07 7.25e+07 703.0 1.23e+07 3.47e+07 7.56e+07 7.61e+08 ▇▁▁▁▁
budget 0 1 4.06e+07 4.37e+07 218.0 1.10e+07 2.60e+07 5.50e+07 3.00e+08 ▇▂▁▁▁
cast_facebook_likes 0 1 1.24e+04 2.05e+04 0.0 2.24e+03 4.60e+03 1.69e+04 6.57e+05 ▇▁▁▁▁
votes 0 1 1.09e+05 1.58e+05 5.0 1.99e+04 5.57e+04 1.33e+05 1.69e+06 ▇▁▁▁▁
reviews 0 1 5.03e+02 4.94e+02 2.0 1.99e+02 3.64e+02 6.31e+02 5.31e+03 ▇▁▁▁▁
rating 0 1 6.39e+00 1.05e+00 1.6 5.80e+00 6.50e+00 7.10e+00 9.30e+00 ▁▁▆▇▁
# show the duplicate movies
movies %>% count(title, sort=T)
## # A tibble: 2,907 × 2
##    title                           n
##    <chr>                       <int>
##  1 Home                            3
##  2 A Nightmare on Elm Street       2
##  3 Across the Universe             2
##  4 Alice in Wonderland             2
##  5 Aloha                           2
##  6 Around the World in 80 Days     2
##  7 Brothers                        2
##  8 Carrie                          2
##  9 Chasing Liberty                 2
## 10 Cinderella                      2
## # … with 2,897 more rows
# to see what happens with the duplicates
movies %>% filter(title=="Homes")
## # A tibble: 0 × 11
## # … with 11 variables: title <chr>, genre <chr>, director <chr>, year <dbl>,
## #   duration <dbl>, gross <dbl>, budget <dbl>, cast_facebook_likes <dbl>,
## #   votes <dbl>, reviews <dbl>, rating <dbl>
# `distinct` function can only keep the first entry but not latest
# movies <- distinct(movies, title, .keep_all=T)
length(unique(movies$title))
## [1] 2907
movies <- movies %>% 
  group_by(title) %>% 
  filter(votes == max(votes)) %>%
  ungroup()

# there are still duplicates
movies %>% count(title, sort=T)
## # A tibble: 2,907 × 2
##    title                          n
##    <chr>                      <int>
##  1 Chasing Liberty                2
##  2 10 Cloverfield Lane            1
##  3 10 Days in a Madhouse          1
##  4 10 Things I Hate About You     1
##  5 102 Dalmatians                 1
##  6 10th & Wolf                    1
##  7 12 Rounds                      1
##  8 12 Years a Slave               1
##  9 127 Hours                      1
## 10 13 Going on 30                 1
## # … with 2,897 more rows
# to see what happens with the duplicates
movies %>% filter(title=="Chasing Liberty")
## # A tibble: 2 × 11
##   title    genre  director   year duration   gross budget cast_facebook_l… votes
##   <chr>    <chr>  <chr>     <dbl>    <dbl>   <dbl>  <dbl>            <dbl> <dbl>
## 1 Chasing… Comedy Andy Cad…  2004      101  1.22e7  2.3e7              842 30092
## 2 Chasing… Comedy Andy Cad…  2004      101  1.22e7  2.3e7              829 30092
## # … with 2 more variables: reviews <dbl>, rating <dbl>
# do the filter only for the entries of Chasing Liberty 
movies <- movies %>%
  group_by(title) %>% 
  filter(cast_facebook_likes==max(cast_facebook_likes)) %>%
  ungroup()

skim(movies)
Data summary
Name movies
Number of rows 2907
Number of columns 11
_______________________
Column type frequency:
character 3
numeric 8
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
title 0 1 1 83 0 2907 0
genre 0 1 5 11 0 17 0
director 0 1 3 32 0 1366 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
year 0 1 2.00e+03 9.92e+00 1920.0 2.00e+03 2.00e+03 2.01e+03 2.02e+03 ▁▁▁▂▇
duration 0 1 1.10e+02 2.23e+01 37.0 9.50e+01 1.05e+02 1.19e+02 3.30e+02 ▃▇▁▁▁
gross 0 1 5.76e+07 7.23e+07 703.0 1.20e+07 3.45e+07 7.51e+07 7.61e+08 ▇▁▁▁▁
budget 0 1 4.02e+07 4.32e+07 218.0 1.10e+07 2.50e+07 5.50e+07 3.00e+08 ▇▂▁▁▁
cast_facebook_likes 0 1 1.23e+04 2.05e+04 0.0 2.22e+03 4.54e+03 1.68e+04 6.57e+05 ▇▁▁▁▁
votes 0 1 1.09e+05 1.59e+05 5.0 1.95e+04 5.47e+04 1.32e+05 1.69e+06 ▇▁▁▁▁
reviews 0 1 4.98e+02 4.93e+02 2.0 1.97e+02 3.58e+02 6.24e+02 5.31e+03 ▇▁▁▁▁
rating 0 1 6.39e+00 1.06e+00 1.6 5.80e+00 6.50e+00 7.10e+00 9.30e+00 ▁▁▆▇▁

The following table shows the count of movies by genre, ranked in descending order

movies %>% count(genre, sort = TRUE)
## # A tibble: 17 × 2
##    genre           n
##    <chr>       <int>
##  1 Comedy        844
##  2 Action        719
##  3 Drama         484
##  4 Adventure     281
##  5 Crime         198
##  6 Biography     135
##  7 Horror        128
##  8 Animation      35
##  9 Fantasy        26
## 10 Documentary    25
## 11 Mystery        15
## 12 Sci-Fi          7
## 13 Family          3
## 14 Musical         2
## 15 Romance         2
## 16 Western         2
## 17 Thriller        1

Here we have a table with the average gross earning and budget (gross and budget) by genre. We have calculated a variable return_on_budget which shows how many $ did a movie make at the box office for each $ of its budget. We have ranked genres by this return_on_budget in descending order-

movies %>% 
  mutate(movies_return = gross/budget ) %>%
  group_by(genre) %>%
  summarise(avg_gross = mean(gross),
            avg_budget = mean(budget),
            genre_return_on_budget = sum(gross)/sum(budget),
            movie_mean_return_on_budget = mean(movies_return)) %>%
  arrange(-movie_mean_return_on_budget)
## # A tibble: 17 × 5
##    genre        avg_gross avg_budget genre_return_on_budget movie_mean_return_o…
##    <chr>            <dbl>      <dbl>                  <dbl>                <dbl>
##  1 Horror       37782310.  13804379.                2.74                86.1    
##  2 Biography    45201805.  28543696.                1.58                22.3    
##  3 Musical      92084000    3189500                28.9                 18.8    
##  4 Family      149160478.  14833333.               10.1                 14.1    
##  5 Documentary  17353973.   5887852.                2.95                 8.70   
##  6 Western      20821884    3465000                 6.01                 7.06   
##  7 Fantasy      41902674.  18484615.                2.27                 6.10   
##  8 Animation    98433792.  61701429.                1.60                 5.01   
##  9 Comedy       42487808.  24458506.                1.74                 3.70   
## 10 Romance      31264848.  25107500                 1.25                 3.17   
## 11 Drama        36754959.  25832605.                1.42                 2.98   
## 12 Mystery      69117136.  41500000                 1.67                 2.90   
## 13 Adventure    94350236.  64692313.                1.46                 2.44   
## 14 Crime        37601525.  26527405.                1.42                 2.19   
## 15 Action       86270343.  70774558.                1.22                 1.93   
## 16 Sci-Fi       29788371.  27607143.                1.08                 1.58   
## 17 Thriller         2468     300000                 0.00823              0.00823

Here we have a table that shows the top 15 directors who have created the highest gross revenue in the box office. We have shown the total gross amount, the mean, median, and standard deviation per director.

movies %>%
  group_by(director) %>%
  summarise(total_gross = sum(gross),
            mean_gross = mean(gross),
            median_gross = median(gross),
            standard_dev_gross = sd(gross)) %>%
  slice_max ( order_by = total_gross, n = 15)
## # A tibble: 15 × 5
##    director          total_gross mean_gross median_gross standard_dev_gross
##    <chr>                   <dbl>      <dbl>        <dbl>              <dbl>
##  1 Steven Spielberg   4014061704 174524422.   164435221          101421051.
##  2 Michael Bay        2195443511 182953626.   168468240.         125789167.
##  3 James Cameron      1909725910 318287652.   175562880.         309171337.
##  4 Christopher Nolan  1813227576 226653447    196667606.         187224133.
##  5 George Lucas       1741418480 348283696    380262555          146193880.
##  6 Robert Zemeckis    1619309108 124562239.   100853835           91300279.
##  7 Tim Burton         1557078534 111219895.    69791834           99304293.
##  8 Sam Raimi          1443167519 180395940.   138480208          174705230.
##  9 Clint Eastwood     1378321100  72543216.    46700000           75487408.
## 10 Francis Lawrence   1358501971 271700394.   281666058          135437020.
## 11 Ron Howard         1335988092 111332341    101587923           81933761.
## 12 Gore Verbinski     1329600995 189942999.   123207194          154473822.
## 13 Andrew Adamson     1137446920 284361730    279680930.         120895765.
## 14 Shawn Levy         1129750988 102704635.    85463309           65484773.
## 15 Ridley Scott       1128857598  80632686.    47775715           68812285.

We have produced a table that describes how ratings are distributed by genre. The histogram visually shows how ratings are distributed.

movies_rating <- movies %>%
  group_by(genre) %>%
  summarise(mean_rating = mean(rating),
            min_rating = min(rating),
             max_rating = max(rating),
             sd_rating = sd(rating)) 
movies_rating
## # A tibble: 17 × 5
##    genre       mean_rating min_rating max_rating sd_rating
##    <chr>             <dbl>      <dbl>      <dbl>     <dbl>
##  1 Action             6.23        2.1        9       1.04 
##  2 Adventure          6.51        2.3        8.6     1.11 
##  3 Animation          6.65        4.5        8       0.968
##  4 Biography          7.11        4.5        8.9     0.760
##  5 Comedy             6.11        1.9        8.8     1.02 
##  6 Crime              6.92        4.8        9.3     0.853
##  7 Documentary        6.66        1.6        8.5     1.77 
##  8 Drama              6.74        2.1        8.8     0.915
##  9 Family             6.5         5.7        7.9     1.22 
## 10 Fantasy            6.08        4.3        7.9     0.953
## 11 Horror             5.79        3.6        8.5     0.987
## 12 Musical            6.75        6.3        7.2     0.636
## 13 Mystery            6.84        4.6        8.5     0.910
## 14 Romance            6.65        6.2        7.1     0.636
## 15 Sci-Fi             6.66        5          8.2     1.09 
## 16 Thriller           4.8         4.8        4.8    NA    
## 17 Western            5.7         4.1        7.3     2.26
movies %>%
  ggplot(mapping = aes(x = rating)) + 
  geom_histogram(bins=30) +
  facet_wrap(~genre)+
  labs(title = "Distribution of ratings in each genre",
       x = "Rating (1-10)",
       y = "Num of movies") +
  NULL

2.2 Using ggplot to find relationships between variables

Understanding the correlation between gross and cast_facebook_likes. We have produced a scatterplot with Facebook Likes on the X-Axis and Gross Revenue on the Y-Axis.

ggplot(movies, aes(x = cast_facebook_likes, y = gross)) +
  geom_point() +
  geom_smooth(method = "lm")+
   labs(
    title = "Relationship of Facebook Likes vs Gross Revenue of the Movie",
    x = "Facebook Likes",
    y = "Gross Revenue"
  )+
  NULL

We analyze the following from the graph below-

  1. Facebook likes do not seem like a good indicator of the gross as there is no direct correlation as seen from the scatter plot.
  2. We mapped gross to Y axes and number of facebook likes to X, because the gross is the final outcome of a movie, aka dependent variable.

Now we examine the relationship between gross and budget by creating a scatterplot.

ggplot(movies, aes(x = budget , y = gross)) +
  geom_point() +
  geom_smooth(method = "lm") +
   labs(
    title = "Relationship of Gross Revenue vs Budget of the Movie",
    x = "Movie Budget",
    y = "Gross Revenue"
  )+
  NULL

From the plot above we see that, the budget and gross do seem correlated. The higher the budget, it is more likely that the gross may be higher.

Furthermore, we examine the relationship between gross and rating. Segmenting the scatterplot by ‘genre’, we can see the following results-

ggplot(movies, aes(x = rating , y = gross)) +
  geom_point() +
  geom_smooth(method = "lm") +
  facet_wrap(~genre) +
  labs(title = "Gross vs Rating of Movies For Each Genre ",
       x = "Rating",
       y = "Gross") +
  NULL

We can see that:

  • The higher the rating the more will be the gross for the most genres of movies.

  • For movies of some genres like ‘Documentary’, ‘Mystery’, ‘Horror’ and ‘Sci-Fi’, the gross has a very less change with respect to rating. Documentaries certainly have a different business model.

  • Negative correlation even appears.

  • Sample size of genres like ‘Family’, ‘Romance’ , ‘Musical’ is very small with under three values.

3 Returns of financial stocks

We will use the tidyquant package to download historical data of stock prices, calculate returns, and examine the distribution of returns.

We must first identify which stocks we want to download data for, and for this we must know their ticker symbol; Apple is known as AAPL, Microsoft as MSFT, McDonald’s as MCD, etc. The file nyse.csv contains 508 stocks listed on the NYSE, their ticker symbol, name, the IPO (Initial Public Offering) year, and the sector and industry the company is in.

nyse <- read_csv(here::here("data","nyse.csv"))

glimpse(nyse)
## Rows: 508
## Columns: 6
## $ symbol        <chr> "MMM", "ABB", "ABT", "ABBV", "ACN", "AAP", "AFL", "A", "…
## $ name          <chr> "3M Company", "ABB Ltd", "Abbott Laboratories", "AbbVie …
## $ ipo_year      <chr> "n/a", "n/a", "n/a", "2012", "2001", "n/a", "n/a", "1999…
## $ sector        <chr> "Health Care", "Consumer Durables", "Health Care", "Heal…
## $ industry      <chr> "Medical/Dental Instruments", "Electrical Products", "Ma…
## $ summary_quote <chr> "https://www.nasdaq.com/symbol/mmm", "https://www.nasdaq…

Based on this dataset, we have created a table and a bar plot that shows the number of companies per sector, in descending order

# a easier way
# nyse %>% 
#   select(sector) %>% 
#   table() %>%
#   sort(decreasing = T) %>%
#   barplot()

nyse %>%
  group_by(sector) %>% 
  mutate(company_num = count(sector)) %>%
  ggplot(aes(x=company_num, y=fct_reorder(sector, company_num))) +
  geom_bar(stat="identity") +
  labs(title = "Number of Companies in each Sector",
       x = "",
       y = "Sector")

Next, we have chosen some stocks and their ticker symbols and downloaded data. The stocks we chose are: “EBR”,“BBL”,“AEE”,“BCE”,“BRO”,“CAT”,“BUD”,“SPY”

# Notice the cache=TRUE argument inthe chunk options. Because getting data is time consuming, 
# cache=TRUE means that once it downloads data, the chunk will not run again next time you knit your Rmd

myStocks <- c("EBR","BBL","AEE","BCE","BRO","CAT","BUD","SPY" ) %>%
  tq_get(get  = "stock.prices",
         from = "2011-01-01",
         to   = "2021-08-31") %>%
  group_by(symbol) 

glimpse(myStocks) # examine the structure of the resulting data frame
## Rows: 21,464
## Columns: 8
## Groups: symbol [8]
## $ symbol   <chr> "EBR", "EBR", "EBR", "EBR", "EBR", "EBR", "EBR", "EBR", "EBR"…
## $ date     <date> 2011-01-03, 2011-01-04, 2011-01-05, 2011-01-06, 2011-01-07, …
## $ open     <dbl> 13.9, 14.1, 14.3, 14.3, 14.0, 13.8, 13.8, 13.8, 13.9, 14.0, 1…
## $ high     <dbl> 14.0, 14.3, 14.5, 14.3, 14.0, 13.9, 13.8, 13.9, 14.1, 14.3, 1…
## $ low      <dbl> 13.9, 14.0, 14.3, 14.0, 13.8, 13.7, 13.6, 13.7, 13.8, 13.9, 1…
## $ close    <dbl> 14.0, 14.2, 14.4, 14.1, 14.0, 13.7, 13.7, 13.8, 14.0, 14.2, 1…
## $ volume   <dbl> 878300, 802900, 950300, 855800, 586000, 1020300, 1545800, 112…
## $ adjusted <dbl> 9.85, 10.00, 10.13, 9.92, 9.82, 9.63, 9.65, 9.71, 9.84, 10.03…

We can see that the dataset is 8x21,464 tibble with each row being the ohlc (open,high,low,close) and volume for a stock on a given date.

Financial performance analysis depend on returns; If I buy a stock today for 100 and I sell it tomorrow for 101.75, my one-day return, assuming no transaction costs, is 1.75%. So given the adjusted closing prices, our first step is to calculate daily and monthly returns.

#calculate daily returns
myStocks_returns_daily <- myStocks %>%
  tq_transmute(select     = adjusted, 
               mutate_fun = periodReturn, 
               period     = "daily", 
               type       = "log",
               col_rename = "daily_returns",
               cols = c(nested.col))  

#calculate monthly  returns
myStocks_returns_monthly <- myStocks %>%
  tq_transmute(select     = adjusted, 
               mutate_fun = periodReturn, 
               period     = "monthly", 
               type       = "arithmetic",
               col_rename = "monthly_returns",
               cols = c(nested.col)) 

#calculate yearly returns
myStocks_returns_annual <- myStocks %>%
  group_by(symbol) %>%
  tq_transmute(select     = adjusted, 
               mutate_fun = periodReturn, 
               period     = "yearly", 
               type       = "arithmetic",
               col_rename = "yearly_returns",
               cols = c(nested.col))

We have created a table summarising monthly returns for each of the stocks and SPY; min, max, median, mean, SD.

glimpse(myStocks_returns_monthly)
## Rows: 1,024
## Columns: 3
## Groups: symbol [8]
## $ symbol          <chr> "EBR", "EBR", "EBR", "EBR", "EBR", "EBR", "EBR", "EBR"…
## $ date            <date> 2011-01-31, 2011-02-28, 2011-03-31, 2011-04-29, 2011-…
## $ monthly_returns <dbl> -0.0264, 0.0487, 0.0861, -0.0445, -0.0263, -0.0264, -0…
monthlystocks_summarised <- myStocks_returns_monthly %>% 
  group_by(symbol) %>%
  summarise(min_return = min(monthly_returns),
            max_return = max(monthly_returns),
            median_return = median(monthly_returns),
            mean_return = mean(monthly_returns),
            sd_return = sd(monthly_returns))

Plotted a density plot, using geom_density(), for each of the stocks

ggplot(myStocks_returns_monthly, aes(x = monthly_returns)) +
  geom_density() + 
  facet_wrap(~symbol) +
  labs(
    title = "Density Plots for Monthly Returns of Stocks",
    x = "Monthly Return",
    y = "Density"
  ) +
  NULL

Inferences from the plots

The monthly returns for the flatter density plots are more dispersed whereas those with tall peaks are more concentrated. The riskiest stock is EBR and the least risky is SPY (as an ETF) due to the shape of their peaks.

Finally, make a plot that shows the expected monthly return (mean) of a stock on the Y axis and the risk (standard deviation) in the X-axis. Please use ggrepel::geom_text_repel() to label each stock

monthlystocks_summarised%>%
  ggplot(aes(y = mean_return, x=sd_return)) +
  geom_point() + 
  ggrepel::geom_text_repel(aes(label = symbol)) +
  NULL

Inferences from the plots

EBR is the most risky as it has the highest standard deviation of returns. EBR has the highest sd but also a high expected return. AEE, CAT and SPY also have high expected returns with less standard deviation. BBL and BUD have lower expected returns and higher standard deviations meaning they are riskier and do not have high expected returns. SPY produces a good return with lower risk.

4 IBM HR Analytics

We analyse a data set on Human Resource Analytics. The IBM HR Analytics Employee Attrition & Performance data set is a fictional data set created by IBM data scientists. Among other things, the data set includes employees’ income, their distance from work, their position in the company, their level of education, etc. A full description can be found on the website.

First let us load the data:

hr_dataset <- read_csv(here::here("data", "datasets_1067_1925_WA_Fn-UseC_-HR-Employee-Attrition.csv"))
glimpse(hr_dataset)
## Rows: 1,470
## Columns: 35
## $ Age                      <dbl> 41, 49, 37, 33, 27, 32, 59, 30, 38, 36, 35, 2…
## $ Attrition                <chr> "Yes", "No", "Yes", "No", "No", "No", "No", "…
## $ BusinessTravel           <chr> "Travel_Rarely", "Travel_Frequently", "Travel…
## $ DailyRate                <dbl> 1102, 279, 1373, 1392, 591, 1005, 1324, 1358,…
## $ Department               <chr> "Sales", "Research & Development", "Research …
## $ DistanceFromHome         <dbl> 1, 8, 2, 3, 2, 2, 3, 24, 23, 27, 16, 15, 26, …
## $ Education                <dbl> 2, 1, 2, 4, 1, 2, 3, 1, 3, 3, 3, 2, 1, 2, 3, …
## $ EducationField           <chr> "Life Sciences", "Life Sciences", "Other", "L…
## $ EmployeeCount            <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ EmployeeNumber           <dbl> 1, 2, 4, 5, 7, 8, 10, 11, 12, 13, 14, 15, 16,…
## $ EnvironmentSatisfaction  <dbl> 2, 3, 4, 4, 1, 4, 3, 4, 4, 3, 1, 4, 1, 2, 3, …
## $ Gender                   <chr> "Female", "Male", "Male", "Female", "Male", "…
## $ HourlyRate               <dbl> 94, 61, 92, 56, 40, 79, 81, 67, 44, 94, 84, 4…
## $ JobInvolvement           <dbl> 3, 2, 2, 3, 3, 3, 4, 3, 2, 3, 4, 2, 3, 3, 2, …
## $ JobLevel                 <dbl> 2, 2, 1, 1, 1, 1, 1, 1, 3, 2, 1, 2, 1, 1, 1, …
## $ JobRole                  <chr> "Sales Executive", "Research Scientist", "Lab…
## $ JobSatisfaction          <dbl> 4, 2, 3, 3, 2, 4, 1, 3, 3, 3, 2, 3, 3, 4, 3, …
## $ MaritalStatus            <chr> "Single", "Married", "Single", "Married", "Ma…
## $ MonthlyIncome            <dbl> 5993, 5130, 2090, 2909, 3468, 3068, 2670, 269…
## $ MonthlyRate              <dbl> 19479, 24907, 2396, 23159, 16632, 11864, 9964…
## $ NumCompaniesWorked       <dbl> 8, 1, 6, 1, 9, 0, 4, 1, 0, 6, 0, 0, 1, 0, 5, …
## $ Over18                   <chr> "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", …
## $ OverTime                 <chr> "Yes", "No", "Yes", "Yes", "No", "No", "Yes",…
## $ PercentSalaryHike        <dbl> 11, 23, 15, 11, 12, 13, 20, 22, 21, 13, 13, 1…
## $ PerformanceRating        <dbl> 3, 4, 3, 3, 3, 3, 4, 4, 4, 3, 3, 3, 3, 3, 3, …
## $ RelationshipSatisfaction <dbl> 1, 4, 2, 3, 4, 3, 1, 2, 2, 2, 3, 4, 4, 3, 2, …
## $ StandardHours            <dbl> 80, 80, 80, 80, 80, 80, 80, 80, 80, 80, 80, 8…
## $ StockOptionLevel         <dbl> 0, 1, 0, 0, 1, 0, 3, 1, 0, 2, 1, 0, 1, 1, 0, …
## $ TotalWorkingYears        <dbl> 8, 10, 7, 8, 6, 8, 12, 1, 10, 17, 6, 10, 5, 3…
## $ TrainingTimesLastYear    <dbl> 0, 3, 3, 3, 3, 2, 3, 2, 2, 3, 5, 3, 1, 2, 4, …
## $ WorkLifeBalance          <dbl> 1, 3, 3, 3, 3, 2, 2, 3, 3, 2, 3, 3, 2, 3, 3, …
## $ YearsAtCompany           <dbl> 6, 10, 0, 8, 2, 7, 1, 1, 9, 7, 5, 9, 5, 2, 4,…
## $ YearsInCurrentRole       <dbl> 4, 7, 0, 7, 2, 7, 0, 0, 7, 7, 4, 5, 2, 2, 2, …
## $ YearsSinceLastPromotion  <dbl> 0, 1, 0, 3, 2, 3, 0, 0, 1, 7, 0, 0, 4, 1, 0, …
## $ YearsWithCurrManager     <dbl> 5, 7, 0, 0, 2, 6, 0, 0, 8, 7, 3, 8, 3, 2, 3, …

Cleaning the data:

hr_cleaned <- hr_dataset %>% 
  clean_names() %>% 
  mutate(
    education = case_when(
      education == 1 ~ "Below College",
      education == 2 ~ "College",
      education == 3 ~ "Bachelor",
      education == 4 ~ "Master",
      education == 5 ~ "Doctor"
    ),
    environment_satisfaction = case_when(
      environment_satisfaction == 1 ~ "Low",
      environment_satisfaction == 2 ~ "Medium",
      environment_satisfaction == 3 ~ "High",
      environment_satisfaction == 4 ~ "Very High"
    ),
    job_satisfaction = case_when(
      job_satisfaction == 1 ~ "Low",
      job_satisfaction == 2 ~ "Medium",
      job_satisfaction == 3 ~ "High",
      job_satisfaction == 4 ~ "Very High"
    ),
    performance_rating = case_when(
      performance_rating == 1 ~ "Low",
      performance_rating == 2 ~ "Good",
      performance_rating == 3 ~ "Excellent",
      performance_rating == 4 ~ "Outstanding"
    ),
    work_life_balance = case_when(
      work_life_balance == 1 ~ "Bad",
      work_life_balance == 2 ~ "Good",
      work_life_balance == 3 ~ "Better",
      work_life_balance == 4 ~ "Best"
    )
  ) %>% 
  select(age, attrition, daily_rate, department,
         distance_from_home, education,
         gender, job_role,environment_satisfaction,
         job_satisfaction, marital_status,
         monthly_income, num_companies_worked, percent_salary_hike,
         performance_rating, total_working_years,
         work_life_balance, years_at_company,
         years_since_last_promotion)
glimpse(hr_cleaned)
## Rows: 1,470
## Columns: 19
## $ age                        <dbl> 41, 49, 37, 33, 27, 32, 59, 30, 38, 36, 35,…
## $ attrition                  <chr> "Yes", "No", "Yes", "No", "No", "No", "No",…
## $ daily_rate                 <dbl> 1102, 279, 1373, 1392, 591, 1005, 1324, 135…
## $ department                 <chr> "Sales", "Research & Development", "Researc…
## $ distance_from_home         <dbl> 1, 8, 2, 3, 2, 2, 3, 24, 23, 27, 16, 15, 26…
## $ education                  <chr> "College", "Below College", "College", "Mas…
## $ gender                     <chr> "Female", "Male", "Male", "Female", "Male",…
## $ job_role                   <chr> "Sales Executive", "Research Scientist", "L…
## $ environment_satisfaction   <chr> "Medium", "High", "Very High", "Very High",…
## $ job_satisfaction           <chr> "Very High", "Medium", "High", "High", "Med…
## $ marital_status             <chr> "Single", "Married", "Single", "Married", "…
## $ monthly_income             <dbl> 5993, 5130, 2090, 2909, 3468, 3068, 2670, 2…
## $ num_companies_worked       <dbl> 8, 1, 6, 1, 9, 0, 4, 1, 0, 6, 0, 0, 1, 0, 5…
## $ percent_salary_hike        <dbl> 11, 23, 15, 11, 12, 13, 20, 22, 21, 13, 13,…
## $ performance_rating         <chr> "Excellent", "Outstanding", "Excellent", "E…
## $ total_working_years        <dbl> 8, 10, 7, 8, 6, 8, 12, 1, 10, 17, 6, 10, 5,…
## $ work_life_balance          <chr> "Bad", "Better", "Better", "Better", "Bette…
## $ years_at_company           <dbl> 6, 10, 0, 8, 2, 7, 1, 1, 9, 7, 5, 9, 5, 2, …
## $ years_since_last_promotion <dbl> 0, 1, 0, 3, 2, 3, 0, 0, 1, 7, 0, 0, 4, 1, 0…
  1. How often do people leave the company (attrition)
# 1233 employees stay while 237 left (19.2%).
hr_cleaned %>% 
  group_by(attrition)%>% 
  summarise(count = count(attrition))
## # A tibble: 2 × 2
##   attrition count
##   <chr>     <int>
## 1 No         1233
## 2 Yes         237
# to see the how attrition rate changes with years at company
prop.table(table(hr_cleaned[,c("years_at_company","attrition")]),1)[,2]%>%
  barplot(main="Attrition Rate vs Years At Company" , xlab="Years at Company" , ylab="Attrition Rate")
abline(h=0.192,col="red") # avergae attrition rate

  • As we can see from the above table only around 20% of the employees in the dataset left the company during their working years. This shows that employees do not leave that often.
  1. After analyzing the age, years_at_company, monthly_income and years_since_last_promotion, we can see from the histograms from the summary statistics, the only variable that looks normally distributed is age.
skim(hr_cleaned)
Data summary
Name hr_cleaned
Number of rows 1470
Number of columns 19
_______________________
Column type frequency:
character 10
numeric 9
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
attrition 0 1 2 3 0 2 0
department 0 1 5 22 0 3 0
education 0 1 6 13 0 5 0
gender 0 1 4 6 0 2 0
job_role 0 1 7 25 0 9 0
environment_satisfaction 0 1 3 9 0 4 0
job_satisfaction 0 1 3 9 0 4 0
marital_status 0 1 6 8 0 3 0
performance_rating 0 1 9 11 0 2 0
work_life_balance 0 1 3 6 0 4 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
age 0 1 36.92 9.14 18 30 36 43 60 ▂▇▇▃▂
daily_rate 0 1 802.49 403.51 102 465 802 1157 1499 ▇▇▇▇▇
distance_from_home 0 1 9.19 8.11 1 2 7 14 29 ▇▅▂▂▂
monthly_income 0 1 6502.93 4707.96 1009 2911 4919 8379 19999 ▇▅▂▁▂
num_companies_worked 0 1 2.69 2.50 0 1 2 4 9 ▇▃▂▂▁
percent_salary_hike 0 1 15.21 3.66 11 12 14 18 25 ▇▅▃▂▁
total_working_years 0 1 11.28 7.78 0 6 10 15 40 ▇▇▂▁▁
years_at_company 0 1 7.01 6.13 0 3 5 9 40 ▇▂▁▁▁
years_since_last_promotion 0 1 2.19 3.22 0 0 1 3 15 ▇▁▁▁▁
  1. We now analyse the distirbutions of job_satisfaction and work_life_balance
hr_cleaned %>%  
  group_by(job_satisfaction)%>% 
  summarise(countjs = n(),
            percentagejs = countjs/nrow(hr_cleaned)*100)
## # A tibble: 4 × 3
##   job_satisfaction countjs percentagejs
##   <chr>              <int>        <dbl>
## 1 High                 442         30.1
## 2 Low                  289         19.7
## 3 Medium               280         19.0
## 4 Very High            459         31.2
hr_cleaned %>%  
  group_by(work_life_balance)%>% 
  summarise(countwlb= n(),
            percentagewlb = round(countwlb/nrow(hr_cleaned)*100,2))
## # A tibble: 4 × 3
##   work_life_balance countwlb percentagewlb
##   <chr>                <int>         <dbl>
## 1 Bad                     80          5.44
## 2 Best                   153         10.4 
## 3 Better                 893         60.8 
## 4 Good                   344         23.4
  • Job satisfaction is distributed quite evenly however the categories of High and Very High are more common with around 30% each. In terms of work life balance very few people have Bad or the Best work life balance. The majority of people have better work life balance.
  1. We now check for the relationship between monthly income vs education and Monthly income vs gender
ggplot(hr_cleaned, aes(x = fct_relevel(education, 
            "Doctor", "Master", "Bachelor", 
            "College", "Below College"), y = monthly_income)) +
  geom_boxplot()+
  labs(title = "Boxplot of Monthly Income against Education",
       x = "Education",
       y = "Monthly Income")+
  NULL

ggplot(hr_cleaned, aes(x = monthly_income, y = gender)) +
  geom_boxplot()+
  labs(title = "Boxplot of Monthly Income against Gender",
       x = "Monthly Income",
       y = "Gender")+
  NULL

  • As we can see from the boxplot of males and females and their monthly income, females have a higher median monthly income which could mean that the females in the dataset could be more educated. The doctors have the highest monthly median income and those with below college education have the least. Doctors also have the highest variability in the IQR. College has the most outliers due to its low standard deviation.
  1. We now plot a boxplot of income vs job role. The highest-paid job roles appear first.
ggplot(hr_cleaned, aes(x=fct_reorder(job_role,-monthly_income), y=monthly_income)) +
  geom_boxplot() +
  labs(title = "Boxplot of Monthly Income against Job Role",
       x = "Job Role",
       y = "Monthly Income")+
  NULL

  1. Calculate and plot a bar chart of the mean (or median?) income by education level.
hr_cleaned %>% 
  group_by(education) %>%
  summarise(medianinc = median(monthly_income),
            meaninc = mean(monthly_income)) %>% 
  ggplot(aes(x = fct_relevel(education, 
            "Doctor", "Master", "Bachelor", 
            "College", "Below College"),
            y=meaninc)) +
  geom_bar(stat = "identity") +
  labs(title = "Average Monthly Income of Each Education Level",
       x = "Education",
       y = "Average Monthly Income")+
  NULL

  • As we can see the medians and means differ greatly, this is due to the outliers in the dataset with individuals with abnormally high monthly income levels.
  1. We plot the distribution of income by education level.
hr_cleaned %>% 
  ggplot(aes(x=monthly_income)) +
  geom_histogram(bins=30)+
  facet_wrap(~fct_relevel(education, 
            "Doctor", "Master", "Bachelor", 
            "College", "Below College")) +
  theme_wsj() +
  NULL

  1. Graph showing income vs age, faceted by job_role
hr_cleaned %>% 
  ggplot(aes(y=monthly_income, x=age)) +
  geom_point() +
  geom_smooth(method="lm")+
  facet_wrap(~job_role) +
  theme_wsj() +
  NULL

5 Challenge 1: COVID Vaccination Data

The purpose of this exercise is to reproduce a plot using your dplyr and ggplot2 skills. Read the article The Racial Factor: There’s 77 Counties Which Are Deep Blue But Also Low-Vaxx. Guess What They Have In Common? and have a look at the attached figure.

# 3154 unique fips code in election2020_results
length(unique(election2020_results$fips))
## [1] 3154
# check unique values of candidate names
unique(election2020_results$candidate)
## [1] "JOSEPH R BIDEN JR" "OTHER"             "DONALD J TRUMP"   
## [4] "JO JORGENSEN"
data <- election2020_results %>% 
  mutate(votes_percentage = candidatevotes/totalvotes) %>% # calculate percentage
  filter(candidate=="DONALD J TRUMP", mode=="TOTAL") %>% # we only need Trump votes 
  inner_join(vaccinations,by = "fips") %>% # inner join with vaccinations data
  inner_join(population,by = "fips") # inner join with population data


# generate graph below
# install.packages("ggpubr") # to show equation easily
library(ggpubr)
# calculate actual vaccination percentage
actual = sum(data$series_complete_yes)/sum(data$pop_estimate_2019)
data %>% 
  # filter out 0% vaccinated points
  filter(series_complete_pop_pct>0) %>%
  ggplot() +
  geom_point(aes(x=votes_percentage,y=series_complete_pop_pct/100,size=pop_estimate_2019),shape=21,fill="light blue")+
  
  # scale circle size
  scale_size(range = c(0, 20)) + 
  
  # add points in the center of circles
  geom_point(aes(x=votes_percentage,y=series_complete_pop_pct/100),size=0.5)+ 
  
  # add regression line
  geom_smooth(aes(x=votes_percentage,y=series_complete_pop_pct/100),
              method="lm",linetype="dotted",colour="blue",se=FALSE)+ 
  
  # add the equation of regression line
  stat_regline_equation(aes(x=votes_percentage,y=series_complete_pop_pct/100),
                        label.y = 0.1,colour="red",fontface = "bold.italic") + 
  
  # add r square
  stat_cor(aes(x=votes_percentage,y=series_complete_pop_pct/100,label = paste(..rr.label..)),
           label.y = 0.05,colour="red")+ 
  
  # add date
  geom_text(aes(x=0.3, y=0.07,label = "09/03/2021", 
                fontface = "bold.italic"),colour="red")+ 
  
  # add horizonal lines below
  geom_hline(aes(yintercept=0.85), linetype=2) + # herd immunity line
  geom_text(aes(x=0, y=0.85, label = "Herd Immunity threshold (?)", 
                vjust=-1, hjust=0, fontface = "bold.italic"),colour="blue") +
  geom_hline(aes(yintercept=0.539), linetype=2) + # Target line
  geom_text(aes(x=0, y=0.539,label = "TARGET: 53.9%", 
                vjust=-1,hjust=0, fontface = "bold.italic"),colour="blue") +
  geom_hline(aes(yintercept=actual), linetype=2) + # actual line
  geom_text(aes(x=0, y=actual, label = paste("ACTUAL: ", round(actual*100,1),"%"),
                vjust=-1,hjust=0,fontface = "bold.italic"),
            colour="blue") +
  
  # adjuct grid lines
  scale_x_continuous(label=scales::percent_format(accuracy = 1),
                     breaks=seq(0,1,0.05),limits = c(0,1))+
  scale_y_continuous(label=scales::percent_format(accuracy = 1),
                     breaks=seq(0,1,0.05),limits = c(0,1))+
  geom_text(aes(x=0.5,y=1, label = "EACH U.S. COUNTY",vjust=0.5,hjust=0.5,
                family="mono",fontface = "bold"),
            size=5)+
  labs(title = "COVID-19 VACCINATION LEVELS OUT OF TOTAL POPULATION BY COUNTY",
       x = "2020 Trump Vote %",
       y = "% of Total Population Vaccinated")+
  # remove the legend
  theme(legend.position = "none")+
  NULL

6 Challenge 2: Opinion polls for the 2021 German elections

The Guardian newspaper has an election poll tracker for the upcoming German election. The list of the opinion polls since Jan 2021 can be found at Wikipedia and your task is to reproduce the graph similar to the one produced by the Guardian.

The following code will scrape the wikipedia page and import the table in a dataframe.

url <- "https://en.wikipedia.org/wiki/Opinion_polling_for_the_2021_German_federal_election"
# https://www.economist.com/graphic-detail/who-will-succeed-angela-merkel
# https://www.theguardian.com/world/2021/jun/21/german-election-poll-tracker-who-will-be-the-next-chancellor


# get tables that exist on wikipedia page 
tables <- url %>% 
  read_html() %>% 
  html_nodes(css="table")


# parse HTML tables into a dataframe called polls 
# Use purr::map() to create a list of all tables in URL
polls <- map(tables, . %>% 
             html_table(fill=TRUE)%>% 
             janitor::clean_names())


# list of opinion polls
german_election_polls <- polls[[1]] %>% # the first table on the page contains the list of all opinions polls
  slice(2:(n()-1)) %>%  # drop the first row, as it contains again the variable names and last row that contains 2017 results
  mutate(
         # polls are shown to run from-to, e.g. 9-13 Aug 2021. We keep the last date, 13 Aug here, as the poll date
         # and we extract it by picking the last 11 characters from that field
         end_date = str_sub(fieldwork_date, -11),
         
         # end_date is still a string, so we convert it into a date object using lubridate::dmy()
         end_date = dmy(end_date),
         
         # we also get the month and week number from the date, if we want to do analysis by month- week, etc.
         month = month(end_date),
         week = isoweek(end_date)
         )


german_election_polls %>% 
  ggplot() +
  # union
  geom_point(aes(x=end_date,y=union, colour="UNION"),shape=21)+
  # span=0.2 to make the line less smoothed
  geom_smooth(aes(x=end_date,y=union, colour="UNION"),se=F,span = 0.2)+ 
  # spd
  geom_point(aes(x=end_date,y=spd, colour="SPD"),shape=21)+
  geom_smooth(aes(x=end_date,y=spd,colour="SPD"),se=F,span = 0.2)+
  # afd
  geom_point(aes(x=end_date,y=af_d,colour="AFD"),shape=21)+
  geom_smooth(aes(x=end_date,y=af_d,colour="AFD"),se=F,span = 0.2)+
  #fdp
  geom_point(aes(x=end_date,y=fdp,colour="FDP"),shape=21)+
  geom_smooth(aes(x=end_date,y=fdp,colour="FDP"),se=F,span = 0.2)+
  #grune
  geom_point(aes(x=end_date,y=grune, colour="GRUNE"),shape=21)+
  geom_smooth(aes(x=end_date,y=grune,colour="GRUNE"),se=F,span = 0.2)+
  #linke
  geom_point(aes(x=end_date,y=linke,colour="LINKE"),shape=21)+
  geom_smooth(aes(x=end_date,y=linke,colour="LINKE"),se=F,span = 0.2)+
  #display every month
  scale_x_date(date_labels="%b %y",date_breaks  ="1 month")+
  labs(
    x="Date",
    y="Votes %"
  )+
  scale_colour_manual("", 
                      breaks = c("UNION","SPD","AFD","FDP","GRUNE","LINKE"),
                      values = c("black","red","blue","yellow","dark green","purple"))+
  NULL

7 Details

Team Members: Alex Kubbinga, Clara Moreno Sanchezu, Jean Huang, Raghav Mehta, Raina Doshi, Yuan Gao

Time Spent: 8 hours

Additional Information: We have gone through online documentation available on google and stackoverflow.

8 Rubric

Check minus (1/5): Displays minimal effort. Doesn’t complete all components. Code is poorly written and not documented. Uses the same type of plot for each graph, or doesn’t use plots appropriate for the variables being analyzed.

Check (3/5): Solid effort. Hits all the elements. No clear mistakes. Easy to follow (both the code and the output).

Check plus (5/5): Finished all components of the assignment correctly and addressed both challenges. Code is well-documented (both self-documented and with additional comments as necessary). Used tidyverse, instead of base R. Graphs and tables are properly labelled. Analysis is clear and easy to follow, either because graphs are labeled clearly or you’ve written additional text to describe how you interpret the output.